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Despite its importance for rumors or innovations propagation, peer-to-peer collaboration, social 
networking or Marketing, the dynamics of information spreading is not well understood. Since the 
diffusion depends on the heterogeneous patterns of human behavior and is driven by the partici- 
pants' decisions, its propagation dynamics shows surprising properties not explained by traditional 
epidemic or contagion models. Here we present a detailed analysis of our study of real Viral Mar- 
keting campaigns where tracking the propagation of a controlled message allowed us to analyze 
the structure and dynamics of a diffusion graph involving over 31,000 individuals. We found that 
information spreading displays a non-Markovian branching dynamics that can be modeled by a two- 
step Bellman-Harris Branching Process that generalizes the static models known in the literature 
and incorporates the high variability of human behavior. It explains accurately all the features of 
information propagation under the "tipping-point" and can be used for prediction and management 
of viral information spreading processes. 



I. INTRODUCTION 

Each day, millions of conversations, emails, SMS, blog 
posts and comments, instant messages, tweets or web 
pages containing various types of information are ex- 
changed between people. Humans natural inclination to 
share information with others in a "viral" fashion stems 
from the need of socializing and seeks to gain reputa- 
tion, influence, trustworthiness or popularity [1 j. Such 
viral dissemination of information through social net- 
works, commonly known as "Word-of-Mouth" (WOM), 
is of paramount importance in our everyday life. In fact, 
it is known to influence purchasing decisions to the ex- 
tent that 2/3 of the United States economy is driven by 
those kind of personal recommendations [2 j. WOM is 
also important to understand sales and customer value 
[3] H], opinion formation or rumor spreading in social 
networks [5j [6] or to determine the influence of each per- 
son in its social neighborhood [TJ [8] . Despite its impor- 
tance and due to the difficulty (or inability) to capture 
this phenomenon, detailed empirical data on how humans 
disseminate information are scarce [9] , population aggre- 
gated [10] or indirect [TTJ [12]. Moreover, most studies 
have concentrated on asymptotical stationary properties 
of information difussion [T3HT6] . This has hampered the 
study of the dynamics of information diffusion and in- 
deed most of its understanding comes from theoretical 
propagation models running on empirical or synthetic so- 
cial networks in an approach borrowed from epidemiology 
[TTHH]. In those models, information diffusion equates 
to the propagation of virus or diseases that spontaneously 
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pass to others by contagion through the active social con- 
nections of the infected (i.e. informed) agents. 

However, information diffusion mechanisms are funda- 
mentally different from those operating in disease spread- 
ing. In fact, passing a message along has a perceived 
transmission cost, its targets are consciously selected 
among potentially interested individuals [20] [21], depends 
on human volition and, ultimately, is executed on the 
individuals' activity schedule. An obvious implication 
of those peculiarities is that information spreading is 
bound to depend on the large variability observed both 
on the volume and frequency of human activities and on 
the perceived value/cost of transmitting the information. 
For example, the number of emails sent by individuals 
per day [22 , the number of telephone calls placed by 
users [23 , the number of blog entries by user [24] [25] . 
the number of web page clicks per user [26 , and the 
number of a person's social relationships [27 or sexual 
contacts [28 show large demographic stochasticity. In 
fact these numbers are distributed according to a power- 
law (or Pareto) distribution, inconsistent with the mild 
Gaussian or Poissonian stochasticity around population- 
averaged values traditionally assumed in epidemiologi- 
cal models [29] . The same large variability pattern ap- 
plies to the human activities time dynamics: for exam- 
ple, email response delays, market trading frequencies or 
inter-event time of web page visits, telephone calls, etc. 
are well described by power-law or log-normal distribu- 
tions [22] [30] [31] . Recent research has shown that such 
high variability in human behavior alters substantially 
the temporal dynamics of information diffusion and does 
not merely introduce some stochasticity in population- 
averaged models [9j [32] [33]. Thus, it is important to 
incorporate this human behavior into the models. 

Besides, information diffusion travels through social 
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connections thereby depending on the properties of the 
social networks where it spreads. For example, simu- 
lations on synthetic scale-free networks showed that if 
information flowed through every social connection the 
epidemic threshold would be significantly lowered to the 
extent that it could disappear [I3j|34], so that any rumor, 
virus or innovation might reach a large fraction of indi- 
viduals in the population no matter how small the prob- 
ability of being infected. Given the fact that social net- 
works are scale-free [35] those results predict that there 
is a strong interplay between network structure and the 
spreading process. However such is not the case for in- 
formation spreading processes. Our daily experience in- 
dicates that most rumors, innovations or marketing mes- 
sages do not reach a significant part of the population 
[36] . As mentioned earlier, the information transmission 
perceived cost prevents it from traveling inexpensively 
through all possible network paths. Therefore when par- 
ticipants assess the value of the information being passed, 
the impact of their social network structure on the dif- 
fusion process might be diminished. Unfortunately the 
true extent of such influence remains unknown in gen- 
eral. Moreover, the reach of information can be affected 
by the dynamics of human communication [33] and thus 
it is important to understand the interplay between the 
static and dynamical properties of information diffusion. 

Finally, there is an important shortcoming in the data 
currently available to investigate those questions. The 
vast majority of the large amount of data collected on 
information exchanges, for example email, SMS, calls or 
tweets, lacks the details required to follow the dynam- 
ics of a specific content item at the individual's level 
(see however [37 ). Thus, the behavioral stochasticity of 
the individuals caused by the message content is masked 
and observations are limited to people's stochasticity due 
to the transmission media. A representative example of 
this difficulty is the study of communication patterns in 
mobile phone calls [32j [33] [38] in which every commu- 
nication, regardless of the message, is used to partially 
discover the social relationships network through which 
potential messages will spread but is not capable of re- 
vealing the specific dynamics of a particular piece of in- 
formation. In other cases, data is not available at the 
individual participant level but just as population aver- 
aged metrics [20] [36] thereby hiding that different content 
items elicit diverse task prioritization in a given person 
or social segment. The situation is clearly unsatisfactory 
since, to our knowledge and possibly because of privacy 
concerns or data proprietorship, there are not very many 
data sets tracing the propagation of a specific piece of 
content throughout the social network (see however [15). 

To overcome those limitations in the understanding of 
electronic information diffusion, we present here the re- 
sults of a series of controlled Viral Marketing campaigns, 
the commercial form of WOM [39] , that we conducted 
in eleven European countries. In them subscribers of 
a business online newsletter received incentives for rec- 
ommending the newsletter subscription to their acquain- 



tances. The detailed tracking of those recommendations 
revealed the factors impacting the diffusion dynamics of 
that particular piece of information at every step and sug- 
gested a branching process as the mechanism driving the 
dynamics of information diffusion. Thus the Bellman- 
Harris Branching Model, a generalization of the static 
percolation model introduced by Newman [13] for con- 
tagion propagation in networks, accurately describes our 
Viral Marketing campaigns. In particular, this branch- 
ing model explains information diffusion of information in 
random networks and constitutes the simplest approach 
incorporating the human behavior high variability pat- 
terns both in activity volume and in response time. 

The rest of this paper is organized as follows: Sec- 
tion [H] introduces our Viral Marketing campaigns and 
the information viral diffusion mechanism used in them, 



while Subsections II A and II B respectively, present the 
campaigns propagation results data set and analyze the 
observed diffusion dynamics patterns and social connec- 
tivity found in such propagation. Section III follows 



with the analytical formulation of the Bellman-Harris 
Branching Model which includes detailed discussion of 
its phase transitions, asymptotic properties and time dy- 
namics while Section |IV| studies several examples of its 
application to several scenarios of the response time dis- 
tribution in the information propagation. We present our 
conclusions in Section [V] Finally, Appendix A discusses 
aspects of the substrate social network structure that can 
be gleaned through the information propagation process. 



II. VIRAL CAMPAIGNS DESCRIPTION 

We tracked and measured the "Word-of-Mouth" dif- 
fusion of viral marketing campaigns ran in eleven Euro- 
pean markets that invited subscribers of an IT company 
online newsletter to promote new subscriptions among 
friends and colleagues. Campaign participants received 
incentives for spreading the offering through recommen- 
dation emails. The campaigns were fully web based. 
Banner ads, emails, search engines and the company 
web page drove participants to the campaign offering 
site. There, participants could fill in a referral form with 
names and email addresses of those to whom they rec- 
ommended subscribing the newsletter. The submission 
of this form launched recommendation emails including 
a link to the campaign main page whose automatically 
generated URL was appended with codes allowing the 
web server to uniquely assign clicks on it to the sender 
and receiver of the corresponding emai Q The form, al- 
lowing up to four referrals per submission, checked des- 
tination email addresses for syntax correctness and to 
avoid self-recommendations. Cookies prevented multi- 
ple recommendations to the same address and improved 



1 Clicks on referral emails forwarded to a third person could not 
trace that individual and were assigned to their original receiver. 
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FIG. 1. The viral messages diffusion graph of our campaigns 
is a set of 7,118 disconnected cascades like this one observed 
in Spain. Its 122 nodes (represented by dots) are grouped in 8 
generations (horizontal layers) that stem from the generation 
zero node at the top (Seed node, black) and grow through a 
branching process driven by the active nodes (gray) in each 
generation. Its tree-like structure is devoid of closed paths or 
triangles for a clustering coefficient C = 0. 



usability by automatically filling-in sender's data in sub- 
sequent visits to the submission form. Additionally, the 
campaign server logged the time stamp of each step of the 
process (subscription, recommendation submission) and 
removed from records undeliverable recommendations. 

The incentive to potential participants was the possi- 
bility of winning a laptop computer on a lottery taking 
place at the end of the campaign. The goal of such incen- 
tive was threefold: Firstly increasing participation, sec- 
ondly, discouraging indiscriminate referrals which could 
lead to spamming-like behavior and, lastly, ensuring le- 
gal backup for tracking sender-receiver pairs as required 
by the campaign sponsor privacy policy. To reach those 
goals, eligibility to participate the lottery was limited to 
the so-called "successful emails" defined as any recom- 
mendation email whose recipient clicked on the coded 
URL included on it. Thus, the more referral emails sent 
to recipients who opened them and clicked their link, 
the bigger the sender's winning odds. The lottery draw 
was held among successful recommendations only and 
both sender and receiver of the winning recommendation 
would receive the prize. The campaign terms and con- 
ditions, accessible from all web pages, stated that par- 
ticipation in the prize draw implied the sender's and re- 
ceiver's authorization for the system recording the de- 
tails of their email transaction since it was necessary to 



ensure that both parties could receive the prize if their 
email was a winner. Subscribing to the newsletter was 
not required to take part in the prize draw. Campaigns 
in all countries ran in local language but were identi- 
cal otherwise: Same offering, incentive, eligibility rules, 
prize draw mechanism, campaign period, web user inter- 
face and tracking processes. This ensured equivalence of 
the experiment in all countries and allowed tracing dif- 
ferences in observed behavior to the market specifics and 
not to the campaigns execution. In addition, this guar- 
anteed the neutrality of the messages content in regards 
to the recipients' reaction. Unsuccessful emails, discon- 
nected nodes, nodes with invalid or undeliverable email 
addresses, self-recommendations and multiple referrals 
between same nodes were discarded. The message viral 
propagation network was built from such cleansed data 
set and its key parameters measured with standard net- 
work analysis tools. Personal information was encrypted 
to protect the participants' privacy. 



A. Campaigns propagation data set 

Spurred by the sponsor web sites, email marketing and 
exogenous online advertising, a total of 7,225 individuals 
acted as Seed nodes by initiating message diffusion cas- 
cades which subsequently grew through viral pass-along 
driven by 2,002 secondary spreaders which we will also 
designate as Viral nodes in what follows. Thus the vi- 
ral offering touched another 21,956 individuals who did 
not forward it and were, therefore, passive nodes. All 
in all, and as shown in Table |TJ a total N = 31, 183 in- 
dividuals, of which 9,227 were active spreaders, received 
the viral message. Thus, 77% of the campaigns partici- 
pants received the message through the endogenous viral 
propagation mechanism. The 7,188 tree-like, indepen- 
dent propagation cascades originated by this process such 
as the one in Fig. [I] form the Cascades Network, a sparse 
graph whose nodes representing campaign participants 
are connected by 24,207 directed links formed by the 
recommendation emails they sent. Besides, the viral cas- 
cades are generally almost pure trees, with very few loops 
or closed triangles, as evidenced by the Clustering Coef- 
ficient of the network of all markets C cas = 0.0048, which 
is two orders of magnitude lower than typical values re- 
ported for social networks [40]: for example C em i = 0.156 
measured in a typical email network of similar size [41] . 

By analogy to the spreading of diseases [29, diffusion 
of information in a population is often described by aver- 
age quantities. Although receiving and propagating mes- 
sages can be quite involving processes, population-level 
analysis describes information propagation as a function 
of the probability Ai of a person becoming secondary 
spreader after receiving a message from a Seed node and 
of the average number of people f\ contacted by such 
secondary spreaders. In this simple approach those two 
parameters, Transmissibility (Ai) and Fanout coefficient 
(ri), fully characterize the mean- field description. In our 
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Market 


N 


n s 


s 


Smax 


C cas 


f 


T\ 


{r\jSEM 


Ai 


Ri 


France 


11,758 


3,248 


3.62 


139 


0.0000 


2.21 


2.50 


0.1023 


0.062 


0.154 


DE+AT 


7,943 


1,750 


4.54 


146 


0.0049 


2.48 


3.06 


0.1155 


0.092 


0.281 


Spain 


5,260 


843 


6.24 


122 


0.0054 


3.16 


3.45 


0.1909 


0.115 


0.397 


Nordic 


2,509 


524 


4.79 


34 


0.0077 


2.82 


2.91 


0.1836 


0.089 


0.259 


UK+NL 


2,111 


518 


4.08 


25 


0.0112 


2.49 


2.87 


0.2398 


0.067 


0.192 


Italy 


1,602 


319 


5.02 


41 


0.0234 


2.87 


2.80 


0.2301 


0.084 


0.236 


All markets 


31,183 


7,188 


4.34 


146 


0.0048 


2.51 


2.96 


0.065 


0.083 


0.246 



TABLE I. Structural and dynamic parameters of the viral diffusion network by market. Number of nodes (N) and of viral 
cascades (n s ), average cascade size (s = N/n s ), largest cascade size (smax) and Clustering coefficient of the Cascades Network 
(C C as)> The diffusion dynamic parameters are the average number of recommendations sent by Seed nodes (fo) or by Viral 
nodes fi (a.k.a. Fanout coefficient) and the Transmissibility X±. Also shown the Fanout coefficient Standard Error of the Mean 
(vi)sem and Basic Reproductive Number R\. Nordic comprises DK, FI, NO and SE. 



campaigns only 8.36% of the participants receiving a rec- 
ommendation email from other participant engaged in 
spreading it themselves and thus Ai = 0.0836. Those 
secondary spreaders sent, in average, 2.96 messages each 
and hence the Fanout coefficient was r\ = 2.96. Inter- 
estingly, this value is higher than the average number of 
recommendations (fo = 2.51) sent by the Seed nodes that 
triggered cascades after becoming aware of the campaign 
message through market seeding tactics. Such gap stems 
from the combination of two factors: Firstly, a stronger 
involvement in the diffusion of the individuals receiving 
the message from a trusted source versus those who found 
the campaign by chance [42] and, secondly, the "Friend- 
ship paradox" [43], a property of networks which causes 
individuals reached trough messages sent by others to 
be more connected in average than those chosen at ran- 
dom: for example, in a random network with node degree 
distribution P(k) the probability of randomly picking a 
node of degree k' is P{k') whereas the probability of a 
message coming from any node reaching a node of degree 
k' is k'P(k')/k, bigger than P(fc') for k' > k [35]. Thus 
highly connected nodes are more likely to be reached by 
messages already spreading through the network than 
by exogenous marketing tactics (web banners or email 
tactics) which do not benefit from such network effect 
when the message spreading starts. This phenomenon 
causes secondary spreaders to have more contacts in av- 
erage than Seed nodes and more choices to forward the 
message. 

On the other hand, it makes sense to assume that the 
number of recommendations sent by secondary spreaders 
(including not sending any) results from a decision by 
each message recipient that involves a trade-off between 
the message forwarding cost and its perceived value. For 
our campaigns lottery prize for example, and in a popu- 
lation average approach, a reasonable proxy of the per- 
ceived value of winning the prize for residents in a given 
country could be the fraction of the average income of 
its citizens represented by the prize cost in that market. 
Granted, there may be many other factors at play in the 
formation of such perception, but there is a very signif- 
icant correlation (p = 0.6) between the average income 



Node class 


N 


r 




0~ r 


(v)sem 


a 




Seed (0) 


7,225 


2.51 


15.14 


2.97 


0.035 


3.50 


30.52 


Viral (1) 


2,002 


2.96 


18.10 


3.05 


0.068 


3.71 


100.88 


Active (a) 


9,227 


2.61 


15.82 


3.00 


0.031 


3.54 


39.48 



TABLE II. Statistics of the viral campaigns participants rec- 
ommendation activity (r) by node class. Active nodes (a) are 
the union of the Seed (0) and Viral (1) classes. The proba- 
bility distribution of the number of recommendations (r) fits 
a Harris power-law of the form of Eq. |2]) with a and f3 esti- 
mated by the method of moments using f and r 2 . 



and the average number of recommendations f\ sent by 
secondary spreaders in each market which indicates that 
the expected gain average relative size may be one of 
them (see Table |l|. 

Additionally, the human intervention in such decision 
process is at the root of a very unique property of the 
dynamics of information diffusion. Comparing viral cam- 
paigns parameters in different markets (see Table [i]), we 
observe a wide range of values in their respective informa- 
tion propagation dynamical parameters. Since the cam- 
paigns execution was identical in all markets, those vari- 
ations can only be due to a change in perception of the 
viral offering value and / or of the message forwarding cost 
by customers in each market. Interestingly, variations of 
the Transmissibility (Ai) and the Fanout coefficient (fi) 
present a Pearson coefficient p = 0.92 as evidence of a 
very strong dependence between them. We proved in 
[44] that such dependence has the form 



fi = 1 + 6(1- e~ cAl ), < Ai < 1 



(1) 



which reduces to T\ ~ 1 + a\\ (a = be) for cAi < 1. 
This peculiarity of information diffusion processes, not 
observed in disease epidemics, arises because the deci- 
sions of becoming a spreader and of the number of viral 
messages to send are simultaneously made by each par- 
ticipant which introduces correlation in their averages. 
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FIG. 2. (Color online) Active nodes (Seed-\- Viral) cumulative 
probability distribution for campaigns in all markets (circles). 
Solid line is the fit to a power-law P(r) — H a s / (f3-\-r a ) whose 
pdf exponent is a — 3.54 ± 0.02 (see Table |h| . Dashed line 
is the prediction of a discrete Poisson distribution with the 
mean of the empirical data (f a = 2.61). 



FIG. 3. (Color online) Cumulative distribution of viral cas- 
cades size in all markets (circles). The power-law line under- 
neath the circles is not a fit of the data but the prediction 
of the Bellman-Harris branching model with a power-law pdf 
for the recommendations distribution. The line below it is the 
bran chin g model prediction with a Poisson distribution (see 
Sec. pfl. 



B. Diffusion dynamics analysis 

In a first approximation we could analyze informa- 
tion dynamics by studying the Basic Reproductive Num- 
ber Ri of epidemiology, the average number of sec- 
ondary cases generated by each virally informed individ- 
ual, which results from the definition of the dynamical 
parameters as Ri = Aifi. However, average quantities 
like Ri hide the heterogeneous nature of epidemics [45] 
and also of information diffusion. In fact our campaigns 
show that most of the observed transmission occurs due 
to extraordinary events. In particular, we get that the 
probability distribution function (pdf) of the number of 
recommendations sent is well approximated by the Harris 
discrete distribution 



where H a p is a normalization constant so that 
Y^LiPr ~ 1- This function displays a power-law be- 
havior p r ~ r~ a in its tail starting approximately at 
the cutoff point r* ~ fl 1 ^. Table [h| lists the distribu- 
tion parameters for Seeds (po,r) 5 Viral (pi, r ) and total 
Active (p a ,r) nodes while Fig. [2] shows the probabil- 
ity distribution of the recommendations sent by Active 
nodes in all markets, and the comparison to the proba- 
bility predicted by a Poisson discrete distribution with 
mean f = 2.61, same as that of the empirical data. The 
markedly different behavior between both of them indi- 
cates the high probability of finding individuals making a 
large number of recommendations. As noted in the intro- 
duction, such high demographic stochasticity, observed in 
many other human activities j22ti28] . suggests that hu- 
mans' response to a particular task cannot be described 
by close- to- average models where they are all assumed 



to behave in a similar fashion with some small degree of 
demographic stochasticity [46]. In sharp contrast with 
population homogeneous models of information spread- 
ing, we found that 2% of the active population in our 
viral campaigns has r a > 10 suggesting the existence of 
super-spreading individuals. 

Super-spreading individuals have also been found in 
non-sexual disease spreading [45] where they significantly 
increase outbreak sizes. In a similar manner, the sizes of 
the information cascades found in our campaigns indicate 
that super-spreading individuals are responsible for mak- 
ing large viral cascades rarer but more explosive. The 
probability distribution of the campaigns cascades sizes, 
represented (see Fig. [3|, is also a fat-tailed distribution 
(in fact, the tail can be fitted to a power law p s ~ 
with /3 c± 3.2). In contrast, neglecting the existence of 
super-spreading individuals but still considering some de- 
gree of stochasticity in the number of recommendations 
by assuming p a ^ r is a Poisson distribution with the same 
average, a cascade like the one in Fig. [T] would have an 
occurrence probability of approximately once every 10 12 
Seed nodes, a number much larger than the total world 
population (see Fig. [3|. 

An element to consider in the aforementioned spread- 
ing stochasticity is the impact, if any, of the underlying 
social network heterogeneity in a similar way to that of 
the connectivity of a computer network on the diffusion of 
computer viruses [47 . Social networks data reveals that 
humans show large variability in their number of social 
contacts [48 . Thus, the connectivity k{ of email net- 
works whether measured by email traffic or by the users' 
email address books is fat-tailed distributed [4lJ [47] . In 
some cases it is power-law distributed like the number 
of recommendations in our campaigns. Large variabil- 
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ity in the numbers of social contacts has a deep effect 
on disease spreading p~3j|34]. In fact, disease spreading 
models on networks show that if information flows with 
the same probability through any link in a social net- 
work, its topological properties can significantly lower the 
"tipping-point"^] However, while indiscriminate propa- 
gation can happen in computer viruses, diseases or other 
mechanistic processes, the human handling of informa- 
tion diffusion limits the influence of the social network 
structure: we expect, in general, the number of recom- 
mendations to be small compared to the social connec- 
tivity (ri <C ki). While in social networks the "Friend- 
ship paradox" [35j [43] implies that k nn ^> k (with k nn 
the average number of social contacts of an individual's 
neighbors and k the average number of social contacts 
of an individual), our recommendation network features 
r nn = f v ~ f s . If, as supposed in most models [T7J [34] , 
information flows through a fraction of the social con- 
tacts of an individual, we should have f nn ^> f instead. 
A way to recover our result is to assume that V{ and k{ are 
largely independent. Our tree-like diffusion cascades lead 
to a low undirected clustering coefficient [35] of the viral 
cascades network (C cas = 0.048) compared to the values 
reported for email social networks (C soc ~ 0.15 — 0.25) 
[47] which supports such assumption. Assuming n and 
ki independent, we get (App. [A] ) 

2Ri 

Ccas ~ 77F — : — X Csoc, Rl < 1 (3) 

{(knn) ~ 1) 

where k nn is the average number of social contacts of the 
neighbors of an individual. In social networks (k nn ) is 
a large number which leads to a very low clustering co- 
efficient even for processes close to the "tipping-point" 
(Ri ~ 1). This fact explains the unreasonable effective- 
ness of tree-based theory to explain information diffusion 
on networks with clustering [49]. In conclusion, large het- 
erogeneity of recommendations activity is due to the par- 
ticipants' behavior rather than consequence of their con- 
nectivity degree which is just the activity upper bound. 

Finally, another important aspect to consider in the 
dynamics of information diffusion is the nodes' reaction 
to receiving a message: Shall they decide to spread it, 
how long do they take to do so?, for how long do they 
remain active?, and, is their responsiveness correlated in 
any way to the number of contacts they resend the mes- 
sage to? The answer to these questions lies in the increas- 
ing evidence that the timing of many human activities, 
ranging from communication to entertainment and work 
patterns, follow non-Poisson statistics, characterized by 
bursts of rapidly occurring events separated by long pe- 
riods of inactivity [22]. In fact, our campaigns revealed 
that most of the active nodes turn inactive right after 



2 Epidemiology term designating the point in a contagion process 
where its spreading rate increases dramatically and changes the 
nature of the process. 




infected ( da V S ) 

FIG. 4. (Color online) Relationship between the time (days) 
a Viral node remains active (infected) refected and the time 
elapsed until it resends its first message (r r espouse) for each 
Viral node in our campaigns. The line is Ti n f ec ted = r r es P onse- 
Nodes on the line sent the message to all contacts at once 
while those outside it remained spreaders for a longer period. 
Only early responders (r r es P onse < 10 days) have some likeli- 
hood of staying active for more than one forwarding session. 



spreading the information once which means that Viral 
nodes do not remain as spreaders for a long time. The 
top panel in Fig. [4] shows that for most of the Viral nodes 
(actually 97% of them) , the lapse of time between receiv- 
ing the message and passing it along r response equals the 
interval between receiving the message and the last time 
it has been resent rejected- The fact that for the most 
part Viral nodes show just one spreading event means, 
from a modeling perspective, that diffusion follows an al- 
most pure "birth and death" model. Besides, the time 
dynamics of the viral recommendation process is inde- 
pendent from the number of recommendations n sent 
by Viral nodes as was shown in [9], that is there is no 
correlation between such number and the response time 
^response as evidenced by the Pearson correlation coef- 
ficient of the two variables (p = —0.05). As we have 
shown in [9], the probability distribution function of the 
Viral nodes response time P(r response ) is a long tailed 
log-normal in another evidence of the humans' large het- 
erogeneity in WOM diffusion. In this sense, participants 
behave like a SIR model in which infection and decay to 
the recovered state happen at the same time [29] . 



III. BRANCHING DYNAMICS MODEL 

The study of our experimental data leads to a theo- 
retical framework for the process of information diffusion 
where the dynamics of information viral spreading is ex- 
plained by tree-like cascades. Each information cascade 
stems from an initial Seed that starts the viral message 
propagation with a random number of recommendations 



FIG. 5. Flowchart example of cascades generated by the Bellman-Harris branching model used to explain diffusion of information 
in social networks: the cascade starts with a Seed (labeled 0) which sends the information to r — 3 of its social contacts after 
time To. Viral nodes 1 and 2 are "infected" and forward the message to r = 3 and r — 2 social contacts after times n and 
T2 respectively, while uninterested node 3 remains inactive. Values of r and r are independent and sampled from distributions 
P(r) and G(t). Propagation continues until there are no active nodes left. Time increases left to right. 



distributed as po,r and whose average is fo- The individ- 
uals reached by the message become secondary spread- 
ers with probability Ai thereby giving birth to a new 
generation of Viral nodes which, in turn, propagate the 
message further with n recommendations distributed by 
Pi y r with average f\. After sending their recommenda- 
tions individuals become inactive and the process contin- 
ues stochastically through new individuals in successive 
generations until none of the members of the latest one 
spread the message. At that point the information cas- 
cades die out and the propagation ends. This process cor- 
responds to the well known Bellman-Harris (BH) branch- 
ing model [9j [50j [51] which is the simplest mathematical 
framework to study the branching dynamics of informa- 
tion diffusion. It generalizes the static and Markovian 
Galton- Watson model typically used to model informa- 
tion diffusion [9j [14] [I5j [52] or, in general, percolation 
processes in social networks [13] . 

In the BH model, those two distributions, po,r and p\^ r 
(ri = 1,2...), represent the number of recommendations 
sent by Seed and Viral nodes respectively. The introduc- 
tion of two different distributions for the recommenda- 
tions sent by Seed and Viral nodes is not only due to the 
difference in the average number of recommendations ob- 
served in our campaigns (see table [H| but also because, 
in general, in social networks the average connectivity of 
a node's nearest neighbors is higher than the average con- 
nectivity of the network nodes themselves. In particular, 
for completely uncorrelated random networks with dis- 
tribution of connectivity given by P(k) the distribution 
of the number of connections of the nearest neighbors of 
a node is P'(k) — kP{k)/k [53]. The case in which in- 
formed nodes decide not to pass along the information 
can be incorporated in the recommendations distribu- 
tion as the case in which the number of messages sent is 



7*1 = 0. Thus we can construct a family of probability 
distributions of the recommendations sent by nodes 
where 

Pi,o = (l-Ai), Pi, r = \Pi,r r>0 (4) 

from whence one can obtain the average number of rec- 
ommendations in the new distributions which are related 
to the primary and secondary reproductive numbers as 

^2po, r r = A r = R (5) 

r>0 

y^Pi, r r = Xifi = Ri (6) 

r>0 

To formalize the study of the information spreading 
branching process, we define now the generating func- 
tions 

oo oo 

f (x) = ^P0,r^, fl(x) = ^2pl, r X r (7) 

r-0 r=0 

Moments of the distributions can be obtained 
through derivatives of the generating functions 

Ro = m, ^o 2 = /o(l) + /o(l)-[/o(l)] 2 (8) 

where <Tq is the variance of the number of recommenda- 
tions of Seed nodes. We will also assume different cdf of 
response times (ri n f ecte d) for Seed and Viral nodes which 
we will denote as Go(t) and Gi(t). Their means are To 
and Ti respectively. 

We want to determine the probability distribution of 
finding I(t) nodes active (i.e. recommending) at time 
t provided we start with one participant at t = 0, i.e. 
1(0) = 1. To do that we use the following self-consistent 
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argument: since the number of recommendations sent by 
each Viral node are random independent processes, the 
branching process starting from each Viral node after 
a given recommendation, which we denote I\{t) (with 
7i(0) = 1) are independent identically distributed (iid) 
copies of the same process. For example, in Fig. [5] the 
branching process starting from nodes 1 and 2 are iid 
copies of the same process h(t). But also, the h(t) pro- 
cess starting from 1 and the Ii(t) processes starting from 
4 and 5 must be statistically the same. Thus we have 
a self-consistent relationship between the branching pro- 
cess starting at a Viral node and the processes starting 
from any of its r\ recommendations: 



h(t) 



ift<T 
t) if t>T 



(9) 



where lf\t) are iid copies of the branching process 1\ (t) 
and assuming that the recommendation event happens at 
t = r. Note that in this self-consistent equation r\ (the 
number of recommendations made by a Viral node and 
distributed by pi, r ) an d the time r are both random and 
independent. To describe the process we use generating 
functions techniques: we define the generating function 



for h{t) as 
get 



^ /c>0 P[/i(t) = k]s k , and thus we 



Fi(s,t) = r r , „ ift<T (io) 

Finally, since r occurs randomly with cdf Gi(t), one 
can integrate equation (10) over r to get 



F 1 (s,t) = s 



[l-Gi(t)]+/dGi(r)/i[Fi 
Jo 



(s,t-r)} (11) 



The same reasoning can be applied to the Seed nodes, 
with the exception that now the number of recommen- 
dations are distributed according to po,r- Denoting Io(t) 
the process starting from an initial seed then we have 

where once again are j copies of the branching 

process I\ (t) and ro is a random number with probability 
distribution po, r - The same reasoning above leads to 

F (s, t) = s[l - Go (*)] + FdGoWMF^s, t - r)] (13) 
Jo 

This equation is the one that describes the time dy- 
namics of our branching process, starting from a given 
Seed. Note that it is a non-homogeneous equation, since 



it depends on the solution of Eq. 11 Thus we must first 



Identical reasoning can be used to derive the equations 
for So(t) [Si(t)], the size of a cascade at time t starting 
from a Seed or Viral node at t = to obtain 



S (t) 



1 if t<r 

l + ESi^t-r) ift>r 



where 



Si{t) 



1 if t<r 

l + EIli^-r) ift>r 



(14) 



(15) 



Thus, the generating function for the cascade sizes 

oo 

$ (s,t) = J2 p lSo(t) = k]s k (16) 
k=i 

oo 

^ 1 (s,t) = Y / P[Si(t) = k]s k (17) 



k = l 



are the solution of the integro-differential equations 

*o(«, t) = s[l - G (t)] + sf dG (r)/ [*i (a, t - r)] 

Jo 



(18) 



$ 1 (s,t) = s[l-G 1 (t)] + s[ dGi(r)/i[$i( 5 ,t-r)] (19) 

Jo 

Note that these equations generalize the static ones in- 
troduced by Newman [13] and include the example of epi- 
demics in configuration model networks in [54 . General 
solutions for equations (fTTI), (13), (|l8|) and (19) are not 



known, but some special cases and limits can be studied. 
In the following subsections, we study some properties of 
the model and compare its predictions with our experi- 
ments and other theoretical situations. 



A. The "tipping-point" 

We are interested in the dynamical process when time 
is large enough, but also in the asymptotic regime when 
t —> oo. In particular, the overall probability q of extinc- 
tion of the cascade is given by the probability that the 
initial Seed does not propagate the information (1 — Ao) 
and that, even when the Seed propagates the infection to 
some nodes, the branches stemming of the eventual Viral 
nodes die out. In this case the extinction probability q\ 
of a branch starting by a Viral node, i.e. the probability 
of 1\ (t) = (number of new nodes in the branch) for any 
finite time t, results from the generating function as 

<?i = lim P[h(t) = 0] = lim Fi(0,t) = Fi(0,oo) (20) 



Inserting this definition in equation (11) we get that 
qi is the root of 



try to solve Eq. 11 and then insert its solution in Eq. 13 



(21) 
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FIG. 6. (Color online) Size of viral cascades as a function of 
Ai for markets in Table [I] Triangles represent the cascade size 
multiplier 1/(1 — Ri) (left Y axis). Dashed line is not a fit but 



the prediction of the branching model (Eq. 27 ) which diverges 
at the "tipping-point" (fi ~ 5.27, (Ai) c ~ 0.19) estimated by 
the correlation f± = l + 22.48Ai existing between Transmissi- 
bility and Fanout coefficient (solid line, circles, right Y axis). 



Since generating functions are convex and /i(l) = 1 
we get that if R\ = f[(l) < 1 the only solution is qi = 1, 
while if Ri > 1 there exists a solution < q\ < 1. The 
point Ri = 1 is known as the "tipping-point" , since above 
it there is a finite probability 1 — q\ that a viral cascade 
does not die out and thus grows infinitely, while below 
the "tipping-point" q\ = 1 and thus every cascade started 
by a Seed node will eventually die out. Including the 
probability of Seeds not making any recommendation we 
obtain the probability that a cascade dies out 



qo = 1 - A + A gi 



(22) 



and using the results for q\ we get that qo = 1 below the 
"tipping-point" and qo < 1 above the "tipping-point". 
For our campaigns Seeds are active by definition which 
means that Ao = 1 and qo — q\ = 1 in all cases. 

Moreover, using the correlation between Ai and T\ in 
Eq. ([!]) and the condition R\ = (Ai) c r~i = 1 one can esti- 
mate the critical viral transmissibility (Ai) c required for 
the viral message to percolate through a large fraction 
of the entire network. We obtained that (Ai) c = 0.19 
which corresponds to (fi) c = 5.27. Of course this is an 
upper limit to the real "tipping-point" since it is based 
on the assumption that cascades originating from differ- 
ent Seeds do not merge as the propagation progresses 
which is only valid far from the "tipping-point". The 
low average number of recommendations needed to at- 
tain the "tipping-point" illustrates the limited effect of 
the social network topology on the viral campaigns effi- 
ciency: It is not necessary to forward the message to each 
participants' social contact in order to reach a significant 
fraction of the network population. Fig. [6] shows the esti- 
mation of our campaigns message propagation "tipping- 
point" based on such findings. While both Ai and f \ vary 



with the market where the campaign ran (see Table [T]) we 
found that R\ < 1 for all cases, i.e. the viral propagation 
did not reach the "tipping-point". 



B. Asymptotic properties 

As we have seen, below the "tipping-point" q\ = 1, 
that is, all viral cascades die out eventually. This means 
that there must exist an asymptotic distribution for the 
size of the cascades <E>(s, oo) = lim^oo ^(s, t) which is 
the solution of equations (fl8|) and ([l9|) in the limit t — > oo 



$o(s,oo) = s/ [*i(s,oo)] 
$i(s,oo) = s/i[$i(s,oo)]. 



(23) 
(24) 



These equations were obtained previously by Newman 
[T3] . In particular we can obtain the average and variance 
of the cascades size by using (5o(oo)) = <£ (l,oo) and 
Var[S (oo)] = *o (1, oo) + #{,(1, oo) - oo)] 2 to get 



(So(oo)) 
Var[S (oo)] 



1 



i?0 



Ri 



cfqR x + Ro 



Ri 



l-R\ 



(25) 
(26) 



As expected, when we approach the "tipping-point", 
Ri — >• 1, the average and variance of the cascade size 
diverges. With Aq = 1 in eq. (25) we get the following 



expression for the average cascade size at infinite time 



ro 



l-Ri 



< Rt < 1 



(27) 



which, using the parameters for all markets in Table [l| 
estimates the average cascade size in our campaigns as 
s* = 4.4, very close to the observed value (s — 4.34). 
Not only are average cascade sizes well predicted by the 
branching model, but their distribution, which can be 
obtained from the derivatives of &o( s , oo) [13] is properly 
replicated as well when the heterogeneity in the number 
of recommendations is implemented (see Fig. [3|. Both 
results show how accurate the model is in predicting the 
reach of a viral marketing campaign by merely using its 
dynamic parameters. Moreover, since the values of Ai 
and ro, ri can be roughly estimated at the campaign early 
stages, we could have predicted its final reach at the very 
beginning. 



C. Time dynamics 

In the previous subsection we concentrate in the prop- 
erties of the cascades in the asymptotic regime. Here we 
come back to the original equations for the dynamics of 
the nodes (11) and (13) to investigate its time depen- 
dence. Using on them that 



io,i = (Jo,i(*)) 



0*b,i (M) 



ds 



(28) 
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we get 

i (t) = 1 - G (t) +Rof dGo{r)h{t - r) (29) 
Jo 

h(t) = 1 - Gi(t) + Ri [ dG^hit - t) (30) 
Jo 

for the dynamics of the average number of infected par- 
ticipants. 

Once again, the equation for io(t) depends on the 
solution of the integral equation for i\(t). Actually, 
for Go(t) = G\{t) = G(t) we could explicitly write 
i (t) = [1 - G(t)} + (R /Ri)[h(t) - 1 + G(t)}. However 
the solution for ii(t) is not known in general, although we 
can study its asymptotic behavior using Renewal Theory 
[55] . Such behavior strongly depends on the existence or 
not of the so called Malthusian Parameter a (7, G) ( [50] 
p. 142), i.e. the real solution of the equation 



7 



*dG(t) = 1 



(31) 



If this parameter a = a (7, G) exists for 7 = R\ then 
i\{t) behaves asymptotically like 



hit) ~Ce° 



C 



Ri 



aR\ J °° te- at dG{t) 



(32) 



for all values of R\. Although 01(7, G) always exists above 
the "tipping-point" where 7 > 1, there is a large class of 
distributions G(t) for which a (7, G) does not exist when 
< 7 < 1. This is the so called sub -exponential class 
which consists of all distribution functions G(t) such that 



lim 



G* 2 (t) 



1 - G(t) 



= 2 



(33) 



where G* 2 (t) is the twofold convolution of G(t) [51] . 
All those distributions have tails that decay slower than 
any exponential, that is, they are heavy-tailed distri- 
butions which is the best qualitative description of the 
sub-exponential class. Examples of G{t) are power law 
(Pareto-like), stretched exponentials or log-normal dis- 
tributions. For this class of distributions, the asymptotic 
behavior of i\{t) is given instead by the tail of the distri- 
bution 



it) 



l-G(t) 
1-Ri 



(34) 



The asymptotic regime is reached for values of t such 
that 1 — G{t) < 1 — R\ or, equivalently when G(t) > R\. 
For the cascades size we get from equation (19) that 



(Si(t)) = 



1 + f\s 1 {t-r))dG 1 {T) (35) 
Jo 



whose asymptotic behavior, analyzed using Renewal 
Theory, gives 



(Si(t)) 



<fii(oo)>- 



Ri 

l-Rt 



h(t) ifi?i<l 
if R x > 1 



(36) 



IV. EXAMPLES 

In this section we illustrate two kinds of behavior 
that we can find in the time dynamics of the viral cas- 
cades. Specifically we consider the case in which G(t) 
is super-exponential with two significant examples, the 
Poisson process and the Gamma process, and the case 
in which G(t) is sub-exponential with application to the 
log- normal distribution found in section [TVB| 



A. Super-exponential processes 

When G(t) is not sub-exponential the Malthusian pa- 
rameter given by Eq. (31) always exists and the asymp- 



totic solution is given by equation (32). 



Poisson process: Most of the literature assumes that 
G(t) is the cdf of the exponential distribution for the 
response times. Thus, if Go,iW = 1 — e _p01 ^ equation 
(TuT) can be derived once to obtain 



dF (s,t) 
dt 

dt 



= P o{fo[F 1 (s,t)]-F (s,t)} (37) 
= Pl {f 1 [F 1 (s : t)]-F 1 (s,t)} (38) 



and for the moments 

~dt 
d:i 1 

~dt 



po[Roh(t) - io(t)} 
Pl [R 1 -l]i 1 (t) 



(39) 
(40) 



The solution for the second equation with initial con- 
dition ii(0) = 1 is i\(t) = e ait with ol\ = p\{R\ — 1) and 
then 



io(t) 



e^ 1( if oti ^ -po 



Rppo r ct-\t 

R p te~ tp0 if on 



-Po 



where 



oil Pi{R\ - 1) = 



Ri 



(41) 



(42) 



is the Malthusian parameter for 1\ (t) . The resonant case 
ql\ = — po can only happen below the "tipping-point" 
where a\ < 0. Equations ( [39] ) are the linear growth 
Markovian models typically used to understand the dy- 
namics of information spreading in social networks [29] . 
In particular if the number of recommendations depends 
linearly with the substrate social network connectivity 
then pi yr ~ kpk/k and thus R\ = Xk 2 /k to recover the 
result by Pastor-Satorras and Vespignani [34] that the 
Malthusian Parameter 



(i\ = A 



k 2 



k 

P 



(43) 
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Thus, if the social connectivity has a distribution which 
is fat tailed then k 2 ^> k and A c ~ 0. Moreover we re- 
cover the result of [56] in which the Malthusian Param- 
eter, in that case is a\ ^> 1, and leads to an exploding 
exponential that grows very fast in a short time. 

The Poisson process is special, since ql\ depends lin- 
early on R\. Thus the value of R\ for social networks 
influences the total reach of the cascades but also the 
time dynamics. However this is not always the case, as 
we will see for other time processes. Besides, the Poisso- 
nian case tells us that the time dynamics of viral cascades 
is Markovian and that human dynamics can be described 
by differential equations like (39). 

Gamma process: In the case in which the distribu- 
tion of response times is not given by an exponential, the 
behavior below the "tipping-point" is given by Eq. ( 32 ) 



for distributions G(t) not in the sub-exponential class. 
Above the "tipping-point" the Malthusian parameter a 
always exists, but the relationship with Ri can be highly 
non-linear. For example, in many applications it is found 
that the response time distribution G(r) can be fitted to 
the cdf of the gamma distribution [57j [58] , whose pdf is 



P{n) = r1 



k-1 



-ri/e 



e k r(k) 



(44) 



where f\ = kO and Var(ri) = kO 2 . In fact, in [58] 
Vazquez et al. found that the email response time is dis- 
tributed as (|44| with k ~ and ~ 20 days. On the 
other hand, the gamma distribution is used as simple 
model for the response time or lifetime since it can ac- 
commodate different functional behaviors: a delta func- 
tion when k —> oo and kO fixed, a power-law with expo- 
nential cutoff when k < 1, or the exponential case when 
k = 1,1/0 = p. For k > and 6 < oo the gamma dis- 
tribution does not belong to the sub-exponential class. 
Thus the Malthusian parameter always exists and more- 
over it can be calculated exactly as 



Ri 



1/k 



1 



(45) 



This equations shows the non-trivial entanglement in 
the time dynamics of the recommendation process be- 
tween the distribution of recommendations (Ri) and the 
response time distribution (fc,0). In particular, it shows 
that the exponential growth depends not only on the 
mean response time t\ but also on the variance. To show 
this, we take the case Y\ = k6 = 1 fixed and we vary k 
to control the variance. Figure [7] shows that above the 
"tipping-point" a\ diverges when Var(ri) grows and thus 
propagation happens much more rapidly than in the case 
of the Poissonian approximation. The reason for it is that 
above the "tipping-point" the initial exponential growth 
of the infinite cascade is triggered by those people with 
response times below the mean, which in the case of long- 
tailed distributions are also more abundant than those 
with large response times. Below the "tipping-point", 
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FIG. 7. (Color online) Malthusian parameter ol\ for the 
gamma distribution of response times given by equation ( 45 ) 
above (left, \i\ > 1) and below (right, \i\ < 1) the "tipping- 
point". The horizontal line is the malthusian parameter for 
the Poissonian approximation with the same average response 
time, i.e. p = 1. 



the contrary happens: since all cascades die out, their 
time dynamics is controlled by few nodes who, in the 
case of long-tailed distributions, can have large response 
times halting the branching process and slowing down 
the propagation of the information. In particular, Eq. 
(45) recovers the result in [58] that with k ~ and be- 

■1/0, i.e. the time 



low the "tipping-point" we get a\ 
scale is given by the cutoff in the distribution of response 
times. 

However, it is important to note that even in this case, 
the asymptotic dynamics in the limit t — >• oo is still given 
by the exponential decay in equation ( [32] ) which shows 
that although ol\ now depends non-trivially on the mo- 
ments of the G(t) distribution we may describe the dy- 
namics in terms of Markovian equations like ( 39 ) replac- 
ing ql\ by its actual value. 

B. Sub-exponential process 

In the case where G(t) is sub-exponential the Malthu- 
sian Parameter does not exist below the "tipping-point" 
and the process asymptotic dynamics is given by the tail 
of the distribution G(t) as Eq. (34). In particular, this 



implies that we cannot describe the dynamics of viral cas- 
cades by Markovian approximations like the differential 
equations Eqs. (39) a sign for the strong non-Markovian 



character of the process in this situation, which corre- 
sponds to our empirical findings. 

Log-normal process: We concentrate on the case 
where G(t) is the cdf of the log-normal distribution which 
we found to be a good model for the response time in our 
campaigns. Specifically assuming its pdf is 



P(t) 



1 



-(\nt-r 1 f/(2a 2 t ) 



(46) 
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FIG. 8. (Color online) Distribution of the average fraction of 
new participants as a function of the cascades start time in 



our campaigns (circles) compared with Eq. (47) (black line), 
the prediction of the Bellman-Harris model with P(t) the log- 
normal distribution of Eq. (46) and with Pj t) exponential of 
the same mean (red). Dashed line is Eq. (48| ), the asymptotic 
approximation of the Bellman-Harris model with P(t) log- 
normal. Inset: Time dynamics of S(t) the cascades average 
size (circles) accurately predicted by the model for G(t) log- 
normal. In red predictions for G(t) exponential. 



with mean t\ and variance 07, then Eq. (34) tells us that 



H(t) 



2(1 -Ri) 



Erfc 



Int — T\ 



(47) 



where Erfc(x) is the complementary error function. In 
the large t limit we can replace Erfc(x) in Eq. (47) by 



the first term of its asymptotic expansion to obtain 



h(t) 



exp 



/ (In t-ri) 2 \ 



(l-Ri)y/2n Int-T! 



(48) 



which indicates that the decay below the "tipping-point" 
is not exponential and also that it happens in a logarith- 
mic and not in a linear time scale as shown in Fig. [8] This 
in turn implies that the information propagation prevails 
for much longer times than expected, as was shown in [9], 
since the asymptotic dynamics in dying viral cascades 
can be dominated (and halted) by a single individual. 
However, above the "tipping-point" the malthusian pa- 
rameter ai always exists and can be calculated. In this 
case however, it can be very different from the Poissonian 
approximation given by Eq. ( [42] ) since there is not an an- 
alytical solution in closed form for the Laplace transform 



of the log-normal distribution and equation [31] must be 
solved through numerical methods like the one proposed 
in [59]. Finally, an important difference with the super- 
exponential process is that with sub-exponential cdf 's of 
the response times Eq. ( 34 ) shows an asymptotic dynam- 



ics for i(t) that is always universally given by the cdf of 
the response (with a rescaling prefactor dependent on 
Ri). This could be used to measure G(t) if no access to 



individual responses is possible. Note however than in 
the case of sub-exponential distributions, this is not pos- 
sible since the Malthusian parameter in Eq. (31 ) depends 
highly non-trivially on both G(t) and R\. 



V. CONCLUSIONS 

We closely tracked an invariable message propagation 
in an information diffusion process below the "tipping- 
point" (i.e. with Ri < 1) driven by a real viral market- 
ing mechanism run in several European markets. Our 
analysis of the data set of the resulting propagation that 
reached over 31,000 individuals, reveals the striking dif- 
fusion patterns that characterize the dynamics of infor- 
mation diffusion processes as being substantially different 
from the ones used in the epidemic models traditionally 
used to explain information propagation. 

Those characteristic patterns affect both the structure 
of the propagation paths and their dynamics. On the 
structural side, the viral propagation cascades are nearly 
pure trees almost completely devoid of closed loops or 
cycles and feature a very low clustering coefficient which 
is almost two orders of magnitude lower than the one 
typical of the email social networks upon which the viral 
propagation took place. Besides, the recommendations 
spreading activity of the campaigns active participants is 
very heterogeneous and its pdf is a long-tailed power-law 
which explains why most of the observed propagation was 
due to extraordinary events caused by super- spreading in- 
dividuals. On the other hand, the dynamics side of the 
propagation process shows that a majority of the spread- 
ing individuals become inactive right after sending their 
recommendations in what could be considered a "birth- 
and-death" process. Finally, the pdf of the forwarding 
time for the received recommendations is also a very het- 
erogeneous long-tailed distribution, a log-normal in this 
case, and the spreaders forwarding time distribution and 
that of the number of recommendations they sent are 
independent and uncorrelated. 

While there exist in the literature a number of studies 
about the static properties of viral information diffusion 
none of them explain the peculiar features discovered in 
the dynamics of our real campaigns. On the one hand 
most models concentrate only on the static asymptotic 
properties of the viral dynamics like Jurvetson's Viral 
Marketing model [39 , the marketing percolation model 
of Goldenberg and Libai [60 , or the recommendation 
propagation model by Leskovec et al. [20] which pre- 
dicts a power-law with exponent 7 = —1 for the dis- 
tribution of the number of recommendations. On the 
other hand, numerous authors have studied the dynamic 
stochastic rumors [17, 19 , 61, 62 using the Daley-Kendall 
(DK) or the Susceptible-Infective- Refractory (SIR) prop- 
agation models with Markovian differential equations, or 
the elaborate branching model of van der Lans et. al [52 . 
However those models assume that the response time can 
be described by an exponential distribution which facil- 



13 



itates the theoretical analysis since Markovian and thus 
viral information diffusion can be explained by differen- 
tial equations. 

As we have found, this is not the case for our real ex- 
periments and we have described how to model the dy- 
namics of information diffusion by means of the Bellman- 
Harris, which is the minimal framework to understand 
the non- Markovian spreading of information on social 
networks. This model generalizes the branching Galton- 
Watson scheme typically used both in information dif- 
fusion P3HHE [52] and general percolation processes in 
social networks [T3l [34] . Our main result is that the in- 
formation diffusion process object of this research shows a 
branching dynamics with some striking peculiarities that 
result a) from the human characteristic patterns when 
scheduling and prioritizing tasks, b) from the human de- 
cisions on how to select targets for the viral propaga- 
tion, and, c) from the negligible influence of the sub- 
strate social network when the process runs below the 
"'tipping-point". Thus, to explain all of them we pro- 
pose a concise model that considers the large hetero- 
geneity of human behavior but neglects the impact of 
the email social network underlying the diffusion pro- 
cess. The mathematical description of this approach is 
a non-Markovian, Bellman-Harris branching model with 
a sub-exponential (log-normal) distribution of the rec- 
ommendations response time G(t) like the one in Sec- 
tion |IVB[ and two different power- law distributions for 
the number of referrals for the classes of Seed and Viral 
nodes, po,r and pi^ r respectively. Since T{ and T{ in our 
model are both iid random variables, the overall a priori 
probability of transmission of the information between 
two individuals, the Transmissibility Ai, is the average 
over the distributions pi ?r and G\(t) of the transmission 
probability between any two individuals [13]. Thus, per 
Newman [35 , our branching model is equivalent to uni- 
form bond percolation on the same social network and 
several magnitudes of interest (cascades size distribution 
and "tipping-point" ) in the infinite time limit can be ob- 
tained by mapping it onto a bond percolation model. 

Given the distributions po,r 5 Pi,r, Go(t) and Gi(t), this 
model accurately predicts all the magnitudes of inter- 
est of the viral information or WOM diffusion processes: 
the dynamic parameters Transmissibility and Fanout co- 
efficient, the cascades size distribution, its average and 
variance in the asymptotic limit, the cascades network 
clustering coefficient, the message propagation "tipping- 
point" or the precise time dynamics in the asymptotic 
regime. Besides, it allows predictions for processes past, 
but close to, the "tipping-point" provided the substrate 
network of the propagation is large enough to avoid finite- 
size effects and maintain the assumption of its negligible 
influence. The accuracy of those predictions, which can 
be achieved early in the propagation process, make this 
model a valuable tool for managing information diffusion. 
Finally since most information transmission, sharing and 
searching in social networks has limited reach (thus hap- 
pening below the "tipping-point" ) and given the fact that 




FIG. 9. Clustering coefficient C cas for the cascades network 
obtained through simulations of the viral propagation model 
on a real email network (circles) with C soc = 0.22 and (k nn ) = 
18.9 compared with the linear relationship in Eq. (|3j. 



there seems to exists certain universality on both the het- 
erogeneity in the number of actions [22-28 and the sub- 
exponential character of human response times [9 , 22, 30 - 
[33] . our theoretical model is thus the most basic and 
general analytical tool to understand processes like ru- 
mor spreading, cooperation, opinion formation, cultural 
dynamics, diffusion of innovations, etc. 



Appendix A: Clustering coefficients correlation 

Assuming independence between the degree of a social 
network node and the number of messages it sends in a 
diffusion process, the undirected Clustering coefficients 
of the social network C soc and of the cascades network 
C cas are correlated. Both are defined as [35] 



C = 



3 x number of triangles in the network 
number of connected triples in the network 



(Al) 



where "triple" means a node with two edges running to 
an unordered pair of others. If connected, such pair forms 
a triangle. In a mean- field approximation they can be 
estimated as 



C s 

G Ca 



3 x (triang 5QC ) 

(tripl soc ) 
3 x (triang ca5 ) 

( tri Pl C a S ) 



(A2) 
(A3) 



with (triang) and (tripl) being the average of triangles 
or triples by node in the social (soc) or cascades (cas) 
network. The probability of finding a triangle on a given 
node is the probability of it having a triple times the 
linking probability of its end nodes 



P(triang) = P(tripl) x P(close) 



(A4) 



where P(close) is the existence probability of a link in 
the open side of the triad. Due to the independence of 
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social links and recommendations, the average number of 
triangles and triples in the cascades network results 

(triang cas ) = P(tripl) x P(close) x (triang soc ) (A5) 
(triply) = P(tripl) x (tripl soc ) (A6) 



which replaced in A2 , A3 and combined with A4 yield 
Ccas = P(close) x C soc (A7) 

Since nodes reached by a viral message become active 
with probability A and each resends it in average to T\ 
of its (k nn ) — 1 nearest neighbors (excluding the ancestor 



node), the probability of closing the triple is 



P(close) 



2Afi 



2i?i 



{^nn) 



R x < 1 (A8) 



whose factor 2 stems from the fact that either of the 
nodes at the open end of a triple can send t he m essage 

recov- 

i) 



and close the triangle. Replacing P(close) in A7 
ers Eq. ([3| which has been verified (even for R\ 
through simulations on a university email network [63] 
with C soc ~ 0.22. Its correlation with the cascades net- 
work Clustering coefficient as a function of R\ is shown 
in Fig. [9j The low values of C cas explain why our model 
neglects the substrate network structure in the study of 
information propagation below the "tipping-point" . 
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